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Abstract. 

Nonparametric and nonlinear measures of statistical dependence between pairs 
of random variables are important tools in modern data analysis. In particular 
the emergence of large data sets can now support the relaxation of linearity as¬ 
sumptions implicit in traditional association scores such as correlation. Here we 
describe a Bayesian nonparametric procedure that leads to a tractable, explicit and 
analytic quantification of the relative evidence for dependence vs independence. 

Our approach uses Polya tree priors on the space of probability measures which 
can then be embedded within a decision theoretic test for dependence. Polya tree 
priors can accommodate known uncertainty in the form of the underlying sampling 
distribution and provides an explicit posterior probability measure of both depen¬ 
dence and independence. Well known advantages of having an explicit probability 
measure include: easy comparison of evidence across different studies; encoding 
prior information; quantifying changes in dependence across different experimental 
conditions, and; the integration of results within formal decision analysis. 

Keywords: dependence measure, Bayesian nonparametrics, Polya tree, hypothesis 
testing 

1 Introduction 

Quantifying the evidence for dependence or testing for departures from independence 
between random variables is an increasingly important task and has been the focus of 
a number of studies in the past decade. A typical motivating example comes from the 
field of biology where a growing abundance of genetic, proteomic and transcriptomic 
data is being produced. In order to unravel the existing relationships between different 
molecular species (genes, proteins, ...) involved in a biological system, large datasets 
are commonly screened for evidence of association between the pairs of variables. This 
requires adequate statistical procedures to quantify the evidence of dependence (or lack 
of independence) between two samples of typically continuous random variables. 

In this article, we propose a Bayesian nonparametric procedure to derive a proba¬ 
bilistic measure of dependency between two samples x and y without assuming a known 
form for the underlying distributions. In particular let Ado denote a model, or hypoth¬ 
esis, of independence and Adi a model, or hypothesis, of dependence. The posterior 
probability, p(Ad i\x, y), is then a natural measure of the strength of evidence for depen- 
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dence between the two samples against independence. The Bayes Factor quantifying 
the relative evidence in the data in favour of Adi over Ado is simply, 

= p(x,y\Mi) 
p(x\M 0 )p(y\M 0 )’ 

which is the ratio of the prior predictive probability of the observed data given the two 
competing hypotheses. This Bayes Factor implicitly avoids conditioning on the form of 
the unknown distribution functions. The role of Bayesian nonparametrics is to allow 
one to accommodate this uncertainty via a prior measure on the space of probability 
measures, for instance, 


p(x,y\Mi) = J f(x,y\Mi)n(dF\Mi) 


where 7r(-) is a Bayesian nonparametric prior with wide support over the space of prob¬ 
ability measures on the joint sample space fix x fix; see for example Hjort et al. (2010). 

We use Polya tree priors ( Lavine|[l992 Mauldin et al.|[l992 Lavinep994 1 to model 
the unknown distributions of the data. We show that the use of such priors leads to an 
analytic derivation of our association measure p(A4±\x, y). In particular, this measure of 
dependence involves a finite analytic calculation though the Polya tree prior is defined 
over an infinite recursive partition. Polya tree priors have previously been used to de¬ 


rive Bayesian nonparametric procedure for two sample hypothesis testing (Holmes et al. 
2015 Ma and Wong 2011) and extensions of these priors have been proposed to model 


distributions indexed by covariates (Trippa et al. 2011). The “two-sample testing” prob¬ 
lem is different to that considered here in that it considers the same measurement, or 
outcome, Y, measured on independent samples under different conditions and tests for 
evidence of the “treatment” or covariate effect, whereas our paper is concerned with 
exploring evidence for statistical association between two joint measurement variables, 
{Y. X }, recorded together on a set of samples. However, our approach exploits a similar 
framework to the testing procedure from Holmes et al. (2015). In particular, our asso¬ 
ciation measure necessitates the construction of Polya tree priors on a two-dimensional 
space, and as discussed at the end of the paper, this engenders new challenges regarding 
the partitioning scheme to be employed. It is worth noting that Polya trees offer a 
more appropriate nonparametric model than say Bayesian histograms, with Dirichlet 
priors (Leonard 1973), as the recursive tree structure of the Polya tree is indexed on 
the measurement variable, whereas the Dirichlet prior for histograms is for unordered 
categorical data and local dependence between measurement bins must be introduced 
via a hierarchical prior. Moreover the Polya tree is defined via an infinite sequence 
partitioning, bypassing the need to truncate at some level, and, as noted above, our 
approach can compute the Bayes factors from the infinite sequence. 


Numerous frequentist approaches have been developed for identifying associations 


between two samples ( 

Shannon and Weaver 

1949 

Cover and Thomas 

1991 

Reshef 

etahpOll 

Gretton and Gyorfi 2010 

1 but to the best of our knowledge this is the first 


Bayesian nonparametric procedure to quantify of the relative evidence of dependence 
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vs independence. Being able to provide an explicit probability of dependence is attrac¬ 
tive for a number of reasons. First, it can be combined with a variety of probabilistic 
approaches. In particular, it can be easily merged with the decision theory framework 
in order for optimal statistical decisions to be made in the face of uncertainty. Another 
important property of probabilistic measures is their interpretability. Indeed, a given 
level p = p(Mi\x, y ) of this measure is exactly the probability of a dependent generative 
model given the data and the probability of an independent generative model is simply 
1 — p. Over and above the standard arguments in favour of Bayesian inference, one 
explicit consequence of the coherence is that we can explicitly quantify the evidence 
for a change in dependence between two variables across two or more conditions. For 
example, if there is evidence that two dependent variables {X 7 Y} become independent 
on application of a treatment, or across disease states. Answering such questions is 
problematic from a non-Bayesian perspective, as a null hypothesis of dependence is of 
higher dimension than the corresponding alternative hypothesis of independence which 
is nested under the null. This makes the quantification of a p-value extremely challeng¬ 
ing. In Bayesian analysis, the symmetry of the model space makes for a simple and 
intuitive solution. In Section 4 we illustrate this issue using an important application 
in cancer genetics. 

The remainder of the paper is organised as follows. We first introduce the Polya 
tree priors in Section [2] and summarise their main properties. In section 3, we describe 
our nonparametric procedure to test for dependencies between two samples. We then 
illustrate in section 4 our approach on data generated from simple models and then 
we apply our procedure to two real world problems from biology. In the Appendix we 
provide an empirical calibration comparing our method to that of other non-Bayesian 
approaches in the literature. 


2 Polya Tree priors 


Polya trees form a class of distributions for random probability measures F on some 
domain Q (Ferguson 1974) by considering a recursive partition of Q into disjoint mea¬ 
surable spaces and constructing random measures on each of these spaces. A binary 
recursive partitioning is typically used for one-dimensional domains: Q is divided in 
two disjoint sets Cq and C\ which themselves are divided in two other disjoints sets 
C o = Coo U Coi and C\ = Cm U Cn, and so on. The infinite recursive partition is 
denoted by C = {Cj,j = 0,1, 00, 01,10,11,... }; the partition at level k is comprised of 
2 k sets Cj where j are all binary sequences of length k. 


To better understand the probability measure constructed on these nested partitions, 
one can think of a particle going down the tree shown in Figure |T| Left; at each junction 
j, usually represented in binary format, the particle has a random probability 6 :j to 
choose the left branch. In Polya trees, the random branching probability follows a Beta 
distribution, with 0j ~ Beta(ayyo), 0 ^( 1 )). Given the partition C, the sequence of non¬ 
negative vectors A = {a^yo), and the sequence of realisations of the random 

branching variables 0 = {0j}j, it is possible to compute the likelihood for any set of 
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observations x: 


P (x\e,c,A) = l[d;’ o (i~6 J r^ 


(i) 


3 


where the product is over the set of all partitions and rijo and riji denote the number 
of observations that lie in the partitions Cjo and Cj\ respectively. The Beta prior on 
the partition probability is conjugate to the Binomial likelihood and, integrating out 9j 
for all j, we obtain that 



( 2 ) 


where £?(.,.) refers to the Beta function. For more details on Polya Tree priors, we refer 


the reader to Ferguson ( 

1974 

i; 

Lavine 

(1992) 

; Mauldin et al. 

(1992 

); 

Lavine 

(1994 

Ghosh and Ramamoorthi 

(2003) 

Wong et al. 

2010 

)• 



In this paper, we are interested in testing independence between two samples x and 
y. We therefore need to consider the joint space f lx x fly of the two sample spaces. For 
reasons that will become obvious later on, we recursively subdivide this space into four 
rectangular regions. We start with partitioning fl\ x in 4 quadrants, fix x fly = 
CoUCiUC^UCa, and continue with nested partitions defined by Cj = CjoUCjiUCj 2 CCj 3 
for any base 4 number j. Thus the partition at level k is formed of 4 fe sets Cj where j are 
all quaternary sequences of length k. We assume that the sets Cj are rectangular, i.e. 
can be written as a Cartesian product DxE where D C fix and E C fly . We arbitrarily 
choose to denote Cj 0 the left bottom region of the set Cj, Cji the right bottom region, 
Cj 2 the left top region, and Cj 3 the right top region for all j. This recursive partition 
C = {Cj,j = 0,1,2,3, 01, 02, 03,11,... } is illustrated in Figure[lj3. Similarly to the one¬ 
dimensional case, a probability measure can be constructed on this recursive partition 
by defining random branching probabilities in the recursive quaternary partition. In 
the following we will use different distributions for these branching probabilities. 

3 A Bayesian nonparametric measure of dependence 

3.1 The approach 

Given a N sample (x,y) which are i.i.d. realisations of the random vector (X,Y), we 
wish to evaluate the evidence for the competing hypotheses: 


Mo '■ X and Y are independent random variables; 
Mi : X and Y are dependent random variables. . 


We denote by F\y the unknown joint probability distribution of (X, Y) and by Fx and 
Fy the two unknown marginal distributions. Following a Bayesian approach, we aim at 
estimating the posterior probability 


p(Mi\x,y) oc p(x, y\Mi)p(Mi) 
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Two-dimensional case 
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Figure 1: (Left) Construction of a Polya tree distribution in the uni-dimensional case: 
at each junction j the particle has a random probability Oj to choose the left branch 
and 1 — Oj to choose the right one. (Right) Illustration of the first two levels of the 
quadrant partitioning scheme in the two-dimensional case. 
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where p(A4 1 ) represents prior beliefs regarding the competing hypotheses. We specify 
our uncertainty in the distribution Fxy via a Polya tree prior. Denoting by and 
f ly the domains of the probability measures Fx and Fy respectively, we consider a 
recursive quaternary partition of klx x fly into disjoint measurable sets as described in 
the previous section. 

Under model Ado, we assume that samples x and y are independent. We can there¬ 
fore think of the partitioning in terms of x-axis and y-axis separately. Let £ h x and £yy 
denote the independent random branching probabilities which determine the probability 
of going in the “left” region of Cj (i.e. Cj o U C^) and the“bottom” region of Cj (i.e. 
Cjo U Cj!) respectively. Similarly to the one-dimensional case, we assume that the ran¬ 
dom branching probabilities follow Beta distributions, ~ Beta(ayx,(o)> a j,x,(i)) and 
£j t Y ~ Beta(aj y,(o)j a j,y,(i))- By independence of ^x and £j,x, the likelihood of the 
data given the partition C, the sequence of random branching variables S = {^j,x,^j,y}j, 
Ax = {a j,x,(o), a j,x,(i)}j an d Ay = {(*j,Y,(o)i a j,Y,(i)}j can be computed as follows 


p(x,y\Z,C,Ax,A Y ,Mo) = H^'^O - ” ‘Cv ' ”'’ 1 1 ~ Zsr) 


\Uj 2+n j3 


where, for quaternary sequence j £ {0,1, 2, 3, 01,..., 03,..., 31,..., 33,... }, nj is the 
number of observations falling in Cj. Integrating the random branching probabilities 
out, we have 


p(x,y\C,A x , Ay, M 0 ) = ]J 


B(n,jo + rij2 + ajx,(o)i n ji + n j3 + a j,x,( i)) 
B(atj t x,( o)i&j,x,(i)) 

B(jijo + riji + &j,y,( o)j n j2 + nj3 + a j,Y,{ i)) 
B( a j,Y,(0)> a j,Y,(l)) 


( 3 ) 


Under the Adi hypothesis, we do not assume independence between samples x and y. 
In this case, for each set j, the random branching probability 9j = {9j,(o),9j,(i),9j,(2),9j,(3)) 
is a random vector taking values in the simplex S 3 . For every i £ {0, 1, 2, 3}, Oji is the 
probability that the particle falls in the quadrant C ; j. We use Dirichlet distributions 
with parameters ay = (awo), otj,(i), a j,( 2 )i a j,(3)) f° r the random branching variables 
{9j}j (Hanson|[2C)061. Hence the marginal likelihood is 


p{x,y\C,A,Mi) = U 


Bjnj+aj) 

B{otj) 


( 4 ) 


where fij = (njo,nji,rij 2 ,Tijs) and B is the multinomial Beta function defined as 


B{aj) = 


IL=o B(Qj,(j)) 


where F designates the Gamma function. 

Typically the values for the a h a) are of the form ck 2 for a parameters at level k 
(Walker and Mallick 19991; we recall that k is the depth of the set Cj and the length 
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of the quaternary sequence j. We will follow this convention so that aj = O-jj-ij for 
i £ {0,1,2, 3}. In addition, to ensure that the prior distributions are equivalent under 
the two models, we will assume that, for all j, aij,x,( o) = a j,(o) + a j,( 2 ) = 2 aj, <%j,x,( l) = 
+ a j,( 3) = 2a j, 0!j,Y,( o) = a j,(o) + a j,( l) = 2 a,j and aj t y,( i) = ay,( 2 ) + a j,(3) = ^ a j■ 

To compare evidence in favours of both hypotheses, we compute the following ratio 

p(M 0 \x,y) = p(x,y\M 0 ) p(M 0 ) 
p(Mi\x,y) p{x,y\Mi)p(Mi) 


where the first term is the Bayes factor which can be written as a product over all 
partitions: 


p(x,y\M 0 ) = tt, 
p(x,y\Mi) 3 


(5) 


where bj is defined below. From equations ([3]) and Q and expressing Beta and multi¬ 
nomial Beta functions in terms of Gamma functions, we have 


bj = 


r (n j0 + rij 2 
T(n i0 + riji - 
r(4a j )r(a i 

X . , . 


h 2aj)T(riji + n,j 3 + 2aj)T(n,j 0 + riji + 2aj)F(nj 2 + rij 3 + 2a 3 ) 
rij 2 + rij 3 + Faj)T(nj 0 + aj)T(riji + aj)r(n j2 + aj)T(rij 3 + a 3 ) 


The product in ([5]) is defined over the infinite set of partitions. However for any set Cj 
containing zero or one data point (i.e. such that rij = n,jo + nji+rij 2 + nj 3 < 1 ), bj = 1. 
Therefore, only subsets with at least two data points contribute to this product. The 
Bayesian measure of the strength of evidence for dependence between the two samples 
against independence involves a finite analytic calculation even though ([5]) is over the 
infinite number of levels in the tree. 


The procedure is described in Algorithm]!] For each set Cj containing more than one 
datapoint, the term bj measures the relative evidence in favour of given the number 
of datapoints falling in each of the four quadrants of Cj. Intuitively, for each set Cj we 
perform a Bayesian independence test based on the local two-by-two contingency table. 
Polya tree priors provides us with a theoretical framework to perform these “local” 
independence tests at every level while taking into account potential dependences on 
neighboring sets. In addition, it allows us to compute the Bayes Factor analytically 
without having to chose any arbitrary level or any truncation. The parameter aj which 
decreases with the depth of the set Cj enables us to give more importance to “local” 
independence tests at the higher levels than at the deepest levels. In the next subsection, 
we investigate the impact of the choice of this parameter. 


3.2 Sensitivity to choice of A 

The proposed procedure relies on a choice of the sequence of non-negative vector A, 
Ax and Ay- As discussed above, the a parameters are constant per level and such that 
oijto = ck 2 where k is the depth of the set Cj. In addition, aj x,(i) = a j,Y,(i) = 2 ck 2 . 
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Algorithm 1 Bayesian nonparametric evidence for independence 

1. Fix the quadrant partitioning scheme; choose a constant c. 

For every set Cj containing more than one data point, compute bj defined in § 
with cij = ck 2 where k is the depth of the set Cj. 

2. Assuming equal prior belief for both hypotheses, 

p(Mi\x,y) = 1 + ^ b = l-p(M 0 \x,y) 


The parameter c controls the speed of divergence of the a parameters with the depth 
k and therefore the relative contribution of each level of the partition to the Bayes 
Factor. We have investigated the impact of the setting of c (see Figure SI and S2 in 
Supplementary Material) and observe that small values of c typically favours the simpler 
model (Mo) especially when the number of samples is small and there is not enough 
evidence to determine Mi. This is to be expected as Bayesian modelling encompasses 


a natural Occam factor in the prior predictive (see for example chapter 28 of MacKay 


(2003)). We have found c = 5 to be a reasonable canonical choice but practitioners are 
strongly advised to explore the setting for their own analysis. 


3.3 Choice of the partition 

Basic approach: partition centred on the median of the data 


The inference resulting from a Polya tree model is known to strongly depend on the 
specification of the partition C over the data space (Paddock et al. 2003), and a multitude 
of quadrant partitioning scheme could be used in our procedure. As a default, partially 
subjective approach we suggest to construct a partition based on the quantiles of two 
normal distributions (for the x- and y-axis respectively). In other words, both variables x 
and y are transformed through the inverse cumulative distributive function of a normal 
distribution; a quaternary recursive partition of [0,1] x [0,1] is then constructed by 
subdividing it into four rectangular quadrants of equal size (see Figure [2j Left) The 
mean and standard deviation of the normal distributions can be derived from empirical 
estimates of the location and spread of the two samples. In the next section we use the 
median and the median absolute deviation as this choice induces robustness to potential 
outliers. 


A simple partial optimizing procedure for partition centering 

Two random variables X and Y are dependent if and only if the distribution of Y 
conditional on X £ T>x is different to the distribution of Y conditional on X £ V ^ for 
some T>x a compact set of fix and 2?^- is its complement. In the previous paragraph, we 
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Basic partition 


\J/ 

Normalized data 



Original data 



Shifted partition 


M/ 

Shifted data 



Normalized data 



Figure 2: Construction of the partition. Both the basic approach and the shifted 
approach are illustrated on a simulated sinusoidal dataset with some outliers. Under the 
basic approach (Left column), the data are marginally transformed via the inverse of 
the c.d.f. of normal distributions. The shifted approach consists in shifting the central 
location of the partition by a factor <5 and wrapping the data around: the data space 
is divided into two regions Z\ = {(x,y),x < <5} and Z 2 = {(x,y),x > <5} which are 
then juxtaposed. The obtained “shifted” data are then normalised via the inverse of 
the CDF of a normal distribution. Quaternary recursive partitions of [0,1] x [0,1] are 
constructed by subdividing the normalized data into four rectangular quadrants of equal 
size (Bottom panels). 
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suggest centering the partition on the median ( mn x ,m y ) of the data. For such choice of 
partition, at the top-level of the Polya tree our procedure tests whether the distribution 
of Y conditional on X £ T> x = (—00, m x ) is equal to the distribution of Y conditional on 
A' £ The procedure performs the test symmetrically on the x-axis and the y-axis. 
Instead of focusing on the median, it would be more informative to test whether the 
distribution of Y conditionally on X £ T> x is equal to the distribution of Y conditionally 
on A £ for any compact set T> x £ fix- 


In this section, we consider a simple partial optimisation of the partition-centering 
location by considering different compact sets T> X - The approach involves shifting the 
central location of the partition as defined by the top-level split, and wrapping the data 
round to maintain balance in the number of points in each region. Consider a real 
number 5 £ [a, b] where a and b are respectively the minimum and maximum values of 
the data on the x-axis. We denote by ipg the transformation that divides the data space 
in two regions Z\ = {(x,y), x < 6} and Z^ = {(x,y),x > <5} and juxtaposes them as 
illustrated in Figure [2j Right. More formally, ipg ■ [a, b] x fly — > [<5, b — a + <5] x fly such 
that 


*Ps(x,y) 


(b — a + x,y) if x < 6, 
(x,y) otherwise. 


The obtained “shifted” data are then transformed through the inverse cumulative dis¬ 
tributive function of normal distributions and a quaternary recursive partition of [0,1] x 
[0,1] is constructed as in the basic approach. We denote the obtained partition by Cg. 

We consider optimising the marginal evidence of dependence p(Adi |a;, y) by maxi¬ 
mizing the Bayes factor as defined in equation <[5j) over all the partitions Cg for S £ [a, b]. 
The obtained probability of dependence is therefore 


p(Mi\x,y) 


1 

1 +Bg' 


where Bg designates the Bayes Factor given the partition Cg. This approach is called the 
“empirical Bayes approach” in the rest of the paper. Note that optimising the central 
location of the partition with respect to p(Mi\x,y) will naturally tend to inflate the 
evidence for Adi. However, when testing many pairs of random variables for evidence 
of dependence we are mainly concerned with the ranking of the pairs for further anal¬ 
ysis, rather than explicit quantification of the evidence, and partial optimisation of the 
partition may well help to produce more stable and acurate rankings. 


4 Applications 

In this section, we illustrate the performance of our Bayesian nonparametric procedure 
for detecting dependence across different datasets. We first test the procedure on simple 
models proposed by Kinney and Atwal ([2014) and then apply it on two real examples 
from biology. 
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We apply our Bayesian procedure on datasets generated under 5 different models pro¬ 
posed by Kinney and Atwal (20141 as illustrated in Figure [3j Top row: a linear model 
(y = 2x/3 + r/), a parabolic model ( y = 2a; 2 /3 -F 77), a sinusoidal model (y = 2 sin(a;) +77), 
a circular model (x = 10 cos(9) + y and y = 10 sin(0) + rf) and a model called “checker¬ 
board” (x = 10 (i x + 9) + 77 and y = 10 (i y + 9) + 1 7 where i x ~ U{{ 0,1,2,3}) and 
i y = mod(2u, i x ) with u ~ W({0,1})); in addition 9 ~ U{[0 , 2i r]) and for each model we 
have i.i.d noise variables rj ~ Af(0, o 2 ). 


We generated 500 independent data sets of size N for each of the 5 models, and varied 
the level of noise (a), and the number of data points (N). We used our procedure 
to compute the probability of the hypothesis Mi given each of these datasets. The 
partition structure was set using the robust mean and standard deviation of the data, 
and the parameter c was set equal to 5. The frequency distribution (over the 500 
independent runs) of the probability of the hypothesis M 1 for each model as a varying 
number of data points and the level of noise is shown in Figure [3] (Middle rows). The 
red curve represents the median while the light and dark grey area designate the zone 
between the 5th and 95th percentiles and the inter-quartile region respectively. 


As expected, the probability that the two samples are dependent is equal to 0.5 for 
every model when JV = 1 as we assumed equal prior belief of both hypotheses. The 
probability of A4i increases as the number of data points increases and is very close 
to one for every model if N is larger than 4000. It is interesting to note that when 
the number of samples is small there is not enough evidence to determine Mi and the 
Bayes Factor may favour the simpler model Mo- As mentioned previously, this it to 
be expected as Bayesian modelling encompasses a natural Occam factor in the prior 
predictive (see for example chapter 28 of MacKay (2003)). This effect is stronger for 
other smaller values of the parameter c (see Figure SI and S2 in the Supplementary 
Material). 

The logarithm of the Bayes Factor defined in equation © can be decomposed in 
terms of levels in the recursive partition as follows 


log 


f p(z, 1/1-Ado) \ _ y- 
\p(x,y\Mi)J 4^ 


E 


!og (bj) 


ij s.t.Cj in level k 




and the contribution of each level in favour of the independence or dependence model 
(denoted by Bk for the level k) can be investigated. When Bk is close to 0, the contribu¬ 
tion of level k is negligible, whereas large positive (resp. negative) B indicates stronger 
evidence in favour of independence (resp. dependence) at level k. In Figure [3] (Bottom 
row), we show the distribution (over 500 independent runs) of B *. for k £ {1,2,3,4, 5} 
for a sample of N = 150 data generated from the 5 different models with o = 2. We 
observe that in the linear model, the dependence can already be detected at the first 
level, with B\ being strongly negative for most of the generated datasets. However, 
for the four other models, the value of B\ is mostly positive and the top level does 
not contain enough information to detect that there are dependencies between the two 
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variables. In these examples, most of the information in favour of the dependence model 
is in the second level. Deeper levels contribute less to the decision; this is due to the 
form of parametrisation with the a parameters being proportional to k 2 . This decom¬ 
position of the evidence across levels is an attractive qualitative feature of the Polya 
tree testing framework, which can assist the statistical analyst in better understanding 
the dependence structure. 

The symmetric nature of the probability, p(Mo\-) = l—p(Mi\-), allows us to explore 
the ability to detect independence. The probability of the independent hypothesis Mo 
given data sampled from two independent standard normal distributions is shown in 
Figure [4] for increasing number of data points (N). The probability of the independent 
hypothesis increases with N and is very close to one if N is larger than 500. Such a 
measure of independence between variables is problematic to compute for non-Bayesian 
methods, as we are testing for a simpler model nested within a more complex one. 
Frequentist approaches use a p-value which is conditional on Mo being true. Hence as 
it stands it cannot be used as evidence for Mo- 

The performance of our Bayesian nonparametric approach varies between models 
both in terms of number of data points required to detect dependence and in terms of 
noise sensitivity: dependence’s are detected for the circular and checkerboard models 
even for relatively high levels of noise and relatively small number of data points but 
less so for the linear model, which visually appears closer to independence (top plots 
in Figure [3]). In addition, our approach necessitates a relatively high number of data 
points (N > 300) to detect dependence in the parabolic or the sinusoidal models even 
for a level of noise cr = 2. The lack of dependence detection on these two last models 
may be due to the symmetry of the generated data relative to the choice of the parti¬ 
tioning scheme. To overcome this issue, we suggest to change the partitioning scheme 
by optimising the partition centering as described in section |3.3| Figure [5] shows that 
the evidence for independence is strongly increased for those two models when running 
the empirical Bayes approach which consists of maximizing the marginal probability 
under the dependent model over the shifted partition scheme with data wrapping. As 
expected, this empirical Bayes approach inflates the probability of dependence for every 
models included when the data are generated using an independent model. 


4.2 Applications from molecular biology 

Gene expression network form measurements at single-cell resolution 


The field of biology contains numerous examples where a large amount of data has been 
produced and adequate measures to detect dependence between variables are required. 
Here we focus on an example from single-cell biology. Nowadays, the expression level of 
thousands of genes can be jointly measured at single-cell resolution, which allows biolo¬ 
gists to precisely study the functional relationships between genes. In Wills et al. (2013), 
the expression of 96 genes affected by Wnt signaling have been measured in 288 single 
cells. The authors provided evidence that many of these transcriptomic associations 
are masked when expression is averaged over bulk sequencing on many cells. In their 





S. Filippi and C. Holmes 


13 


linear 



parabolic 



sinusoidal 



circular 



checkerboard 


.Sit i 

i2* ; V'-v v 


1 

0.5 

0 




N N 











Figure 3: (Top row) Illustration of the five test data sets and sampling distributions 
with N = 300 data and noise parameter o = 2. (Middle rows 2 and 3) Frequency 
distribution (over 500 independent runs) of the probability of the hypothesis Mi for 
each model varying the number of data points (N) in plot row 2 and the level of noise 
(ct) in plot row 3. When varying the number of data points, the level of noise is fixed 
at <r = 2; when varying the level noise, the number of data points is fixed to N = 300. 
The red curve represents the median while the light and dark grey area designate the 
zone between the 5th and 95th percentiles and the inter-quartile region respectively. 
(Bottom row) Distribution (over 500 independent runs) of the contribution ( Bk ) of the 
5 first levels in the Polya Tree. A negative Bk indicates evidence against independence. 
We set N = 150 and o = 2. 
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Figure 4: (Left) Illustration of the independent models sampling 300 data from two in¬ 
dependent standard normal distributions. (Right) Distribution (over 500 independent 
runs) of the probability of the hypothesis Mo varying the number of data points ( N ). 
The red curve represents the median whereas the light and dark grey area designate the 
zone between the 5th and 95th percentiles and the inter-quartile region respectively. 


Polya Tree 
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Figure 5: Distribution (over 500 independent runs) of the probability of dependence 
using our Bayesian nonparametric approach (Left) and our empirical Bayes approach 
which maximises the marginal probability of dependence over the shifted partition 
scheme with data wrapping (Right). Data are generated under the 5 illustrative ex¬ 
amples as well as an independent generative model where both x and y are vectors of 
i.i.d. samples from a normal distribution with mean 0 and standard deviation 1. Here, 
N = 150 and cr = 2. 
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study the authors investigated the relationships between these genes and constructed 
an expression network using the measurements at single-cell resolution. The expres¬ 
sion network can illustrate potential functional relationships and dependencies that are 
interesting to elicit molecular pathways. The network is constructed by highlighting 
genes that have correlated or anti-correlated expressions between cells using Spearman 
correlation coefficient. We reproduce this network detecting dependences both with 
Spearman correlation and with our Bayesian procedure (see Figure [6] Top row). Our 
procedure detects many associations between genes that were not detected by simple 
correlation analysis: More than 250 pairs of genes have a probability of dependence 
higher than 0.95 and an absolute value of Spearman correlation lower than 0.5 whereas 
every (except one) link with absolute Spearman correlation higher than 0.5 has a prob¬ 
ability of dependence larger than 0.95 (see Figure [6] Bottom row, Left). Some of these 
links that are only detected using our approach are between genes that are known to 
interact such as APC and DVL2, AXIN1 and GSK3B, DVL2 and LRP6 or AXIN1 and 
DACT1 (see Figure [6] Bottom row). Other detected links remain to be investigated. 


Differential co-expression analysis 


Networks have proved themselves to be important representation of biological systems 
where various molecules are interacting and functionally coordinating. A typical exam¬ 
ple is gene expression networks such as the one described in the previous subsection 
where nodes correspond to genes and edges represent interactions between genes. Inter¬ 
actions in biological networks can substantially change in response to different condi¬ 
tions. In particular, gene co-regulations may be altered with disease and an interaction 
between two genes could be present in some conditions and not in other. Differential co¬ 
expression analysis consist in identifying which interactions in gene expression network 
change from one condition to another (Hsu et alj2015). 


The main objectives of differential co-expression analysis is to identify couples of 
genes (x, y) such that the strength of dependence between x and y changes in response 
to different conditions. Our Bayesian procedure is perfectly suited for this type of 
problems which require methods able to detect both dependences and independences. 
Non-Bayesian testing procedures for independences typically only provide p-values to 
identify when the null hypothesis (here, the independence hypothesis) can be rejected. 
To the contrary our approach enables us to quantify the relative evidence of dependence 
vrs independence. In particular, given the expression of two genes in n 

cells under condition A and the expression {ayi}i=i,-. .,n of the same two genes in h cells 
under another condition B, we can calculate the probability of a change of interactions 
between these two genes in response to conditions A and B as follows 


Pdiff({*U, ,n; {-U; Vi\i= 1,...,n) — 


(7) 


p(Afi|{a:i, yi}i) {l - p{Mi\{x il y i }i)) +p(Adi|{i i ,yJ i ) (1 - p{Mi\{xi, yjj) . 


In Curtis et al. (2012), a collection of around 2,000 breast cancer specimens from 
tumour banks in the UK and Canada is analysed and compared to a set of 144 normal 
cells. We propose to apply our algorithm to these gene expression dataset and make use 
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Figure 6: (Top row) Expression network constructed using correlation (Left) - where 
links with absolute correlation larger than 0.5 are shown and darker edges represent links 
with absolute correlation larger than 0.7 - and our nonparametric Bayesian procedure 
(Right) - where links with probability of dependences larger than 0.99999 are shown and 
darker edges represent links with probabilities equal to 1 . Genes are ordered clockwise 
according to increasing number of detected links. (Bottom row - Left) Comparison of 
the probability of dependence computed following our approach and the absolute value 
of Spearman correlation for every pairs of genes. The 4 red circles indicate some pairs 
of genes with known interactions which have an absolute correlation lower than 0.5 but 
a probability of dependences larger than 0.95. (Bottom row - Middle and Right) 
Examples of gene expression data for two genes with known interaction. 
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of the probability in equation 0 in order to identify dysregulation in gene expression in 
response to breast cancer. We focus on comparing a subset of 997 tumour cells (called 
the discovery test in Curtis et al. (2012)) with the set of 144 normal cells; for each cell, 
the expression level of 48,803 probes is available. Following the approach proposed by 


Langfelder and Horvath (2007) we use the gene expression of the normal cells to identify 
25 modules of correlated genes and determine so called “eigengenes” that represent the 
expression of genes in each module. We used the R implementation of this module 


detection and eigengene computation provided in the R package WGCNA (|Langfelder 


and Horvath (2008)). We represent in Figure [7] Left the eigengene expression of some 


selected modules under the two conditions. By computing the probability in ([T]) for the 
300 pairs of modules, we identify interactions between modules that significantly change 
in response to breast cancer. We find that the probability pdiff is larger than 0.95 for 
69 module interactions, represented in Figure [7] Right. Among those 69 interactions, 49 
are interactions that were present in normal cells and vanished in tumour cells whereas 
13 of the interactions only appear in tumour cells. 


The pathway enrichment analysis enables us to implicate each of the 25 modules 
with established biological cascades and clinically-relevant pathways (see Table [I]). The 
two modules with the highest degree in the differential co-expression network are: (a) 
the ALK1 signaling pathway, and (b) complexes associated with translation (e.g. ribo¬ 
some). For the former, this likely indicates loss of regulation and signaling cross-talk; 
for the latter, since ribosomes are essential and translational genes are often tightly reg¬ 
ulated, this hints at wide-spread transcriptomic perturbation. In particular, the high 
degree of the ribosome/translation module indicates that, in the cancerous state, more 
modules are increasingly dysresgulated and out of sync with the more tightly regulated 
translation-involved modules. Our analysis shows that both of these high-degree mod¬ 
ules are disconnected to numerous other important pathways such as the ERK/MAPK 
pathway, or genes involved in immune signalling (such as antigen presentation). More 
generally, this demonstrates that overlaying biological and clinically relevant annota¬ 
tions on the differential co-expression network may be the basis for further research 
regarding transcriptomic alterations in breast tumors. 


5 Discussion/Conclusion 

We have presented a novel Bayesian nonparametric approach that quantifies a proba¬ 
bilistic measure for the strength of evidence for dependence between two samples against 
that of independence. The procedure is based on Polya tree priors that facilitate an 
analytic expression for the Bayes factor even though the Polya tree prior is defined over 
an infinite recursive partition. We have applied our approach to simulated datasets as 
well as applications in molecular biology including single-cell gene-expression analysis 
and network analysis in cancer genetics. 

The inference resulting from a Polya tree model is known to strongly depend on the 
specification of the partition C over the data space. We have proposed an empirical 
method to select the partition centring by optimising the marginal likelihood in favour 
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Figure 7: (Left) Expression of eigengenes for different modules (module 2, 3,4, 10 and 
13) in the 144 normal cells and in the 997 tumor cells. We observe that some modules 
are strongly interacting under one of the conditions (Normal or Breast Cancer) and 
not under the other. Each dot corresponds to the expression of two eigengenes in one 
cell. (Right) Differential co-expression network: nodes correspond to modules of genes; 
edges represent module interactions that significantly change between normal and cancer 
conditions. Red continuous edges correspond to interactions that are present in normal 
cells and vanished in tumour cells; blue dashed edges correspond to interactions that 
are only present in tumour cells. 
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Module Number 

Over-represented Pathway(s) 

1 

Cell Cycle, Mitotic 

2 

ATP sensitive Potassium channels 

3 

Gene Expression 

4 

Elastic fibre formation 

5 

IFN-alpha/beta pathways + Interferon Signaling* 

6 

ALK1 signaling events 

7 

Translation + Ribosome* 

8 

Bladder cancer - Homo sapiens (human) 

9 

Apoptotic cleavage of cell adhesion proteins 

10 

ERK/MAPK targets 

11 

Extracellular matrix organization 

12 

AP-1 transcription factor network 

13 

Metabolism 

14 

Generation of second messenger molecules 

15 

The citric acid (TCA) cycle and respiratory electron transport 

16 

Thromboxane signalling + ADP signalling 

17 

Peptide ligand-binding receptors 

18 

Generic Transcription Pathway 

19 

Type I hemidesmosome assembly 

20 

Peptide chain elongation 

21 

PI3K-Akt signaling pathway - Homo sapiens (human) 

22 

Eukaryotic Translation Elongation 

23 

Immune System 

24 

Metabolism 

25 

Mitochondrial translation elongation 


Table 1: Pathway enrichment results, using three pathway databases (PID, KEGG, 
and Reactome), for the 25 modules identified. For each module, the most significant 
pathway was selected based on adjusted p-values. Ties were resolved by taking the 
smaller pathway (for pathways with large discrepancy in size), or by the most significant 
by unadjusted p-values. 
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of dependence. This tends to inflate the evidence in favour of dependence, but can 
improve the ranking when testing over many pairs of variables. 

Our probabilistic measure has importance advantages over other statistics due to 
its interpretability in terms of a recursive partition of the data space, symmetry over 
those based on Kullback-Leibler divergence. The explicit quantification of a probability 
allows for combining with other sources of information within a prior or meta-analysis. 
As shown in Section 4.2 the Bayesian approach provides a unified method for detecting 
both independence and dependence, something that is not possible without a fully 
probabilistic framework. The Bayesian probabilistic approach allows for the inclusion 
on substantive prior information on the plausibility of an association, which can be 
particularly useful for screening large biological data sets. There is also the possibility 
to embed the model within a hierarchical structure, borrowing strength coherently across 
categories, something that is simply not possible for existing methods based on iron- 
probabilistic methods such as Mutual Information. 
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Appendices 

A1 Details on derivation of the Bayes Factor 

Under Mo, the samples x and y are assumed to be independent; therefore 
p{x, y |H, C, Mo) = p{x\E x ,C, M 0 )p(y\E. Y ,C, Mo) 

= llCv " 2(1 - 

3 i 

where and Sy = {£j,y}j- Since we assumed that the random branching 

probabilities are independent and follow Beta distributions, ~ Beta^^x (o) i a i,Jf,(i)) 

and £yy ~ Beta(a Ji y i (o),aj^y^i)), then for all j, 


P(€j,x\M 0 ) = 


Ci X ’ (0) 




where £?(.,.) denotes the Beta function. We therefore obtain equation ([3| by integrating 
out the random branching probabilities as follows: 

p(x,y\C,Ax,A Y ,M 0 ) = J p{x,y\E,C,A x ,A 0 , M 0 )p(E\M 0 )dE 

^njo+n i 2+a i ,* 1 (o)-l^ _ ^ ^ njl+nj3+aj x (1) - 1 


n 


n 


B( a j,x,(o)i a j,x,(i)) 

_ £. x )n J2 +n i3 +a j ,y, (1) -l 
X ~-iTi- 1 -r- 

■®( a i.y>(o)) a i,y,(i)) 

B(rijo + rij2 + aij iX ,(o)i n ji + n j3 + Q<j,x,(i)) 

B(rijo + Tiji + a j,Y,( o)j + Tij3 + Q!j,y,(i)) 
^( a j.y>(o)i a i,y,(i)) 


d£j,xd£j : Y 


Under hypothesis A4i, we consider a probability vector of random branching prob¬ 
abilities Oj = ( 0 j,(o)> 0 j,(i)) 0 j,( 2 )) 0 j,( 3 )) € § 3 so that 

p(*,»|e,c,A4,) = n«S) e U)»lS)«13, 


where 0 = {6*j}y - Assuming that 9j follows a Dirichlet distribution with parameters 

‘ (2 ); ; 


p(O j \M 1 ) = 


« ,_1 ® ) " 1 W 1 ( 1 - - ^.(D - ^AP))^ 00 " 1 
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where B{.) is the multinomial Beta function, which can be expressed in terms of the 
gamma function: 

_g( a .) = r ( Q i,(Q)) r ( a J.(i)) r ( Q; j,(2)) r ( a: j,(3)) ^ 

3 r ( a i,(0) + a j,(l) + a j,{ 2) + a j,( 3)) 

Similarly to the .Ado case, the marginal likelihood can be obtained by integrating out 
the random branching probabilities as follows: 


p(x,y\C,A,Mi) 


J p(x,y\Q,C, A, Mi)p(Q\M!)dQ 



ji 


dQji — 


j 


B{hj + atj ) 


where hj denotes the vector of counts of data: hj = {n,jo, riji, n.j 2 , rij^). 

To compute the Bayes factor we calculate the ratio of p(x,y\M. q) as defined in 
equation ^ over p(x,y\Mi) as defined in equation Q. Since both ([3]) and Q. can 
be written an infinite product over all possible sets, then the Bayes factor can also be 
written ■ bj where 


k _ B(rijo + nj2 + o), n ji + n j3 + a j, x,(i)) 

x + riji + ajY,(o)i n j2 + Jij3 + a j,Y,( i)) B{ a j) 

B{ a j,Y,(o)> a j,Y,(i)) B(hj + aj) 

Expressing the Beta and the multinomial Beta function in terms of Gamma functions, 
we obtain equation ©• 

It is easy to see that for any j such that rijo + riji + rij 2 + rij 3 = 0, 

r(2a 3 ) 4 r(4a 3 )r(q J -) 4 

J r (4a j )r(a i ) 4 r(2a i ) 4 

In addition, if rijo + riji + rij 2 + rij 3 = 1, then 

T(1 + 2a i ) 2 r(2a J ) 2 r(4a J )r(a i ) 4 
J ~ r(l + 4aj)r(l + a,)r(a,) 3 r(2a,) 4 
r(l + 2 % -) 2 r(4a,)r(a i ) 
r(l + 4a j )r(l + a i )r(2a i ) 2 

(2 aj ) 2 r(2a j -) 2 r(4q,-)r(q j ) 

4 ajT(4a j )a j r{a j )T(2a j ) 2 

= 1 

where in the third line we use that for all t, T(t + 1) = tT(t). 
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A2 Other approaches 


The square Pearson correlation is probably the most commonly used statistic to iden¬ 
tify associations between two samples. This statistic accurately quantifies linear depen¬ 
dences but fails to detect dependencies when relationships are highly nonlinear. An¬ 
other criteria, called mutual information (MI), arose from the field of information theory 
(Shannon and Weaver|1949 Cover and Thomas|1991 1. The original primary purpose of 


mutual information was to quantify bits of informations transmitted in a system. It has 


been used in a wide range of disciplines such as in economy (Maasoumi 1993 

Maasoumi 

and Racine 2002), wireless security (’ 

3loch et al. 2008), pattern analysis 

Peng et al. 

2005), neurobiology (Pereda et al.|200f 

>) and systems biology (Cheong et al. 

2011 Liepe 


a similarity measure between the joint probability of two variables and the product of 
the two marginal probability distributions; for the purposes of this paper considering 
continuous univariate random variables, {x, y }, MI is equivalent to the Kullback-Leiblcr 
divergence between the joint distribution and the product of marginal densities 


MI = 


P{x, y) log 


p(x,y) 

p{x)p{y) 


dxdy . 


It is equal to 0 when the variables are independent and increases with the level of asso¬ 
ciation of the two variables. Therefore, by extension, this measure has been employed 
to quantify dependencies between two variables. It is well-known that the estimation of 
the mutual information between two samples is not straightforward. It is often based 
on approximations of the probability distributions (Paninski 2003 Cellucci et al.||2005 


Khan et al.|2007 ) either using bin-based procedures, kernel densities (Moon et al. 1995), 
or k-nearest neighbours (Kraskov et al.||2004). 


Another classical approach for detecting dependencies between two continuous uni¬ 
variate random variables consists in partitioning the sample space into bins and evalu¬ 
ating non-parametric test statistics on the binned data (Heller et al.]2014). This type of 
approach includes the well known Hoeffding’s test (Hoeffdingj 1948) as well as the max¬ 
imum information criterion (MIC) recently introduced by Reshef et al. (2011). In this 
last paper, the authors propose to estimate the mutual information using a bin-based 
procedure on any grid drawn on the scatterplot of the two variables up to a maximal 
grid precision. The MIC is then proportional to the highest mutual information on 
these grids. This recent work has lead to a series of discussion regarding properties that 
adequate measures of dependencies should fulfil ( Gorfine et al.|2012 Simon and Tibshi- 
rani||2014 Kinney and Atwal||2014 1. More generally, dependences between multivariate 
random variables have commonly been modelled using copula transform (Hoff |2007 
Smith et al.||2012~) or by introducing latent random variables (Petrone et al. 


2009). For 


high-dimensional spaces, the problem of testing independence can also be formulated by 


embedding probability distributions into reproducing kernel Hilbert space (Gretton and 
Gyorfi|20 10). These papers provide innovative approaches to modelling dependence but 
do not provide a fully Bayesian nonparametric approach with an analytic Bayes factor 
for testing dependence vs independence, which is the focus of our paper here. 
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In the following, we consider 5 frequentist dependence statistics: the Rsquared 


measure, the distance correlation (dcor) (Szekely et al. 2009), Hoeffding’s D (Hoeffd- 


ing 1948), the mutual information estimated using the k-nearest neighbour method 


(Kraskov et al. 2004) and the maximum information criterion (MIC) (Reshef et al. 


2011). We use the MATLAB implementation of these algorithms provided in the sup¬ 


plementary material of Kinney and Atwal (20141. The Rsquared measure, the distance 
correlation and Hoeffding’D methods detect higher associations between the variables 
for linear models; whereas MIC and MI detect highest dependencies for the sinusoidal 
and the circular models respectively (see Figure Al). In addition to its probabilistic 
properties, a crucial advantage of our Bayesian approach compared to these five other 
approaches is the interpretability of the measure of dependencies. Indeed, whenever the 
dependence statistic is higher than 0.5 there is, by definition, more evidence in favour of 
the dependence hypothesis than the independent one. In contrast, for all the frequentist 
methods identifying a threshold above which one can claim that two samples are depen¬ 
dent is a challenging task, as opposed to claiming evidence against the null. Typically, 
these thresholds are either chosen heuristically or calibrated via the construction of “null 
datasets” - by randomly permuting the indexes of the two samples in order to destroy 
potential dependence - and defining the threshold as a quantile of the distribution of 
the dependence statistics on these “null” datasets. 


It is difficult to identify a measure that would allow us to fairly compare our Bayesian 
approach to more traditional frequentist approaches. A traditional frequentist measure 
of statistical test performance consists in computing the power of the test which mea¬ 
sures the true positive rate (percentage of times the method detect dependences) for 
a given significance level (i.e. false positive rate). Here, for each dependence test, we 
chose a significance level of 0.05, that is, we fix the detection threshold to be equal to 
the 0.95 quantile of the “null distribution” estimated via 500 permutations. In Fig¬ 
ure [A2j we observe that the power of every method strongly varies from one generative 
model to another. The power of the empirical Bayes version of our procedure (denoted 
by EPT in Figure |A2[ ) is comparable with that of the Mutual Information algorithm 
estimated using the 20-nearest neighbours. Further power analysis via ROC curves is 
shown in Figure |A3| Using a threshold based on the “null distribution” is very atypical 
for Bayesian approaches as there exists a natural threshold for the probabilistic measure 
which is equal to 0.5. Using this threshold the true positive and false positive rates for 
different generative models are summarized in the following table. 



linear 

True positive rate 
parab. sinus, circul. 

check. 

False positive 
rate 

PT, N = 150, o- = 2 

0.82 

0.31 

0.33 

1 

0.82 

0.13 

EPT, N = 150, cr = 2 

0.92 

0.95 

0.97 

1 

1 

0.42 

PT, N = 300, o = 4 

0.45 

0.17 

0.21 

0.91 

0.56 

0.09 

EPT, N = 300, cr = 4 

0.62 

0.81 

0.91 

0.98 

0.96 

0.4 
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Figure Al: Distribution (over 500 independent runs) of the dependence measures 
quantified by the Rsquared correlation (black), the dCor (light grey) and the Hoeffding 
method (dark grey), the MIC (green) and the Mutual Information estimated with the 
k-nearest method for different values of k (blue). Data are generated under the five 
illustrative examples as well as an independent generative model where both x and 
y are vectors of i.i.d. samples from a normal distribution with mean 0 and standard 
deviation 1. Here, N = 150 and a = 2. 
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Figure A2: Power of each algorithm for each model for N = 150, o = 2 (Top row) and 
N = 300, o = 4 (Bottom row). The null distribution is computed via permutation; the 
significance level (i.e false positive rate) is set equal to 0.05. PT stands for Polya Tree 
and EPT designates the empirical version of our approach. 
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Figure A3: ROC curve illustrating the true positive rate as a function of the false 
positive rate for each model and each algorithm for N = 150, a = 2 (Middle row) 
and N = 300, a = 4 (Bottom row). The colour scheme is similar to the one from 
figure |Al| except that the red curve represents our Bayesian nonparametric approach 
and the dashed red curve our approach using an empirical Bayes approach maximizing 
the marginal probability over the shifted partition scheme with data wrapping. For 
clarity, the ROC curve for Rsquared is not shown. 
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Figure A4: Impact of the parameter c when o = 2. (Top) Median (over 500 
runs of the probability of the dependent models as a function of the number of data 
points (N) for different values of c. (Bottom) ROC curve for different values of c when 
N = 300. In both figure types, cr = 2 and the color scheme represents different values 
of c £ {0.1,1, 5,10} where c = 0.1 corresponds to the red line and c = 10 corresponds 
to the yellow line. 
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Figure A5: Impact of the parameter c when a = 4. (Top) Median (over 500 
runs of the probability of the dependent models as a function of the number of data 
points (N) for different values of c. (Bottom) ROC curve for different values of c when 
N = 300. In both figure types, cr = 4 and the color scheme represents different values 
of c £ {0.1,1, 5,10} where c = 0.1 corresponds to the red line and c = 10 corresponds 
to the yellow line. 

























